Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

128

Applications in Natural Language Processing

computation of GELU and Softmax, a known algorithm [49] for integer calculation of

square root is utilized to perform integer-only computation for LayerNorm. Finally, an

integral framework is introduced by exploiting these approximations of GELU, Softmax,

and LayerNorm. The illustration of I-BERT is presented in the right side of Fig. 5.4.

5.4.1

Integer-Only Computation of GELU and Softmax

For the integer-only computation of GELU and Softmax, they use a class of interpolating

polynomials to approximate the function. Given the function value for a set of n+1 diﬀerent

data points {(x0, f0), ..., (xn, fn)}, the target is to ﬁnd a polynomial of degree at most n that

exactly matches the function value at these points. They argue that a unique polynomial of

degree at most n passes through all the data points and proposes an analytic solution for

the target polynomial.

5.4.2

Integer-Only Computation of LayerNorm

For integer-only LayerNorm, the challenge is that the input statistics (i.e., μ, and σ) change

rapidly along with training, and these values must be calculated dynamically during run-

time. Computing μ is straightforward, however, evaluating σ requires the square-root func-

tion. To approximate the square-root function by integer-only calculation, the authors adopt

an eﬃciently iterative algorithm proposed in [49]. Given any non-negative integer input n,

based on the Newton’s Method, the algorithm iteratively searches for the exact value of

⌊^√n⌋and only requires integer arithmetic. Then, the rest of the non-linear operations in

LayerNorm such as division and square are straightforwardly computed with integer arith-

metic.

The integer-only quantization results for RoBERTa-Base/Large are presented in Ta-

ble 5.3. As one can see, I-BERT consistently achieves comparable or slightly higher accuracy

than the baseline. For RoBERTa-Base, I-BERT achieves higher accuracy for all cases (up to

1.4 for RTE), except for MNLI-m, QQP, and STS-B tasks. Also, a similar behavior on the

RoBERTa-Large model can be observed, where I-BERT matches or outperforms the base-

line accuracy for all the downstream tasks. On average, I-BERT outperforms the baseline

by 0.3/0.5 for RoBERTa-Base/Large, respectively.

TABLE 5.3

I-BERT quantization result for RoBERTa-Base and RoBERTa-Large on the

development set of the GLUE benchmark. Baseline is trained from the pre-trained

models, and I-BERT is quantized and ﬁne-tuned from the baseline.

RoBERTa-Base

Method Precision MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.

Baseline

FP32

87.8

87.4

90.4 92.8

94.6

61.2

91.1

90.9

78.0 86.0

I-BERT

INT8

87.5

87.4

90.2

92.8 95.2 62.5

90.8

91.1 79.4 86.3

Diﬀ

-0.3

0.0

-0.2

0.0

+0.6 +1.3

-0.3

+0.2 +1.4 +0.3

RoBERTa-Large

Method Precision MNLI-m MNLI-mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.

Baseline

FP32

90.0

89.9

92.8

94.1

96.3

68.0

92.2

91.8

86.3 89.0

I-BERT

INT8

90.4

90.3

93.0 94.5 96.4 69.0

92.2

93.0 87.0 89.5

Diﬀ

+0.4

+0.2 +0.4 +0.1 +1.0

0.0

+1.2 +0.7 +0.5